533 research outputs found

    Bridging Dense and Sparse Maximum Inner Product Search

    Full text link
    Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-kk retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-kk retrieval methods. We study IVF-based retrieval where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that IVF serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and demonstrate their potential. First, we cast the IVF paradigm as a dynamic pruning technique and turn that insight into a novel organization of the inverted index for approximate MIPS for general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, and show its robustness to query distributions

    An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors

    Full text link
    Maximum Inner Product Search or top-k retrieval on sparse vectors is well-understood in information retrieval, with a number of mature algorithms that solve it exactly. However, all existing algorithms are tailored to text and frequency-based similarity measures. To achieve optimal memory footprint and query latency, they rely on the near stationarity of documents and on laws governing natural languages. We consider, instead, a setup in which collections are streaming -- necessitating dynamic indexing -- and where indexing and retrieval must work with arbitrarily distributed real-valued vectors. As we show, existing algorithms are no longer competitive in this setup, even against naive solutions. We investigate this gap and present a novel approximate solution, called Sinnamon, that can efficiently retrieve the top-k results for sparse real valued vectors drawn from arbitrary distributions. Notably, Sinnamon offers levers to trade-off memory consumption, latency, and accuracy, making the algorithm suitable for constrained applications and systems. We give theoretical results on the error introduced by the approximate nature of the algorithm, and present an empirical evaluation of its performance on two hardware platforms and synthetic and real-valued datasets. We conclude by laying out concrete directions for future research on this general top-k retrieval problem over sparse vectors

    TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank

    Full text link
    Learning-to-Rank deals with maximizing the utility of a list of examples presented to the user, with items of higher relevance being prioritized. It has several practical applications such as large-scale search, recommender systems, document summarization and question answering. While there is widespread support for classification and regression based learning, support for learning-to-rank in deep learning has been limited. We propose TensorFlow Ranking, the first open source library for solving large-scale ranking problems in a deep learning framework. It is highly configurable and provides easy-to-use APIs to support different scoring mechanisms, loss functions and evaluation metrics in the learning-to-rank setting. Our library is developed on top of TensorFlow and can thus fully leverage the advantages of this platform. For example, it is highly scalable, both in training and in inference, and can be used to learn ranking models over massive amounts of user activity data, which can include heterogeneous dense and sparse features. We empirically demonstrate the effectiveness of our library in learning ranking functions for large-scale search and recommendation applications in Gmail and Google Drive. We also show that ranking models built using our model scale well for distributed training, without significant impact on metrics. The proposed library is available to the open source community, with the hope that it facilitates further academic research and industrial applications in the field of learning-to-rank.Comment: KDD 201

    Efficiency and timing performance of the MuPix7 high-voltage monolithic active pixel sensor

    Full text link
    The MuPix7 is a prototype high voltage monolithic active pixel sensor with 103 times 80 um2 pixels thinned to 64 um and incorporating the complete read-out circuitry including a 1.25 Gbit/s differential data link. Using data taken at the DESY electron test beam, we demonstrate an efficiency of 99.3% and a time resolution of 14 ns. The efficiency and time resolution are studied with sub-pixel resolution and reproduced in simulations.Comment: 7 pages, 13 figures, submitted to Nucl.Instr.Meth.

    Drug-microenvironment perturbations reveal resistance mechanisms and prognostic subgroups in CLL

    Full text link
    The tumour microenvironment and genetic alterations collectively influence drug efficacy in cancer, but current evidence is limited and systematic analyses are lacking. Using chronic lymphocytic leukaemia (CLL) as a model disease, we investigated the influence of 17 microenvironmental stimuli on 12 drugs in 192 genetically characterised patient samples. Based on microenvironmental response, we identified four subgroups with distinct clinical outcomes beyond known prognostic markers. Response to multiple microenvironmental stimuli was amplified in trisomy 12 samples. Trisomy 12 was associated with a distinct epigenetic signature. Bromodomain inhibition reversed this epigenetic profile and could be used to target microenvironmental signalling in trisomy 12 CLL. We quantified the impact of microenvironmental stimuli on drug response and their dependence on genetic alterations, identifying interleukin 4 (IL4) and Toll-like receptor (TLR) stimulation as the strongest actuators of drug resistance. IL4 and TLR signalling activity was increased in CLL-infiltrated lymph nodes compared with healthy samples. High IL4 activity correlated with faster disease progression. The publicly available dataset can facilitate the investigation of cell-extrinsic mechanisms of drug resistance and disease progression
    • …
    corecore